Native functions of short tandem repeats

Over a third of the human genome is comprised of repetitive sequences, including more than a million short tandem repeats (STRs). While studies of the pathologic consequences of repeat expansions that cause syndromic human diseases are extensive, the potential native functions of STRs are often ignored. Here, we summarize a growing body of research into the normal biological functions for repetitive elements across the genome, with a particular focus on the roles of STRs in regulating gene expression. We propose reconceptualizing the pathogenic consequences of repeat expansions as aberrancies in normal gene regulation. From this altered viewpoint, we predict that future work will reveal broader roles for STRs in neuronal function and as risk alleles for more common human neurological diseases.


Introduction
At least a third of the human genome is comprised of repetitive sequences (de Koning et al., 2011;Gemmell, 2021;Britten and Kohne, 1968). Some of the first genomic repetitive elements were discovered in association with disease. As a result, pathogenic roles of repeats were well studied, while potential native functions of these repeats were largely dismissed. However, the conservation of genomic repeats among different eukaryotic species (Eichler et al., 1995;Sulovari et al., 2019;Liquori et al., 2003) and their high polymorphism rates compared to other types of genetic variations  suggests that repeats may have important biological functions in addition to the pathogenic ones. A growing body of research has revealed complex biological and evolutionary functions for repeats across the genome. Here, we summarize the important functions of one type of genomic repeat, short (2-6 base pair) tandem repeats (STRs), in DNA, RNA, and as proteins. We then reframe STR toxicity observed in repeat expansion disorders (REDs) as an aberrancy of native STR functions, rather than as solely a emergent property disconnected from the native repeat. Finally, we discuss how this alternative view of STR toxicity can improve our understanding of roles of STRs in neuronal function and human health.

A brief history of repetitive DNA
Repetitive elements in DNA were first discovered by Barbara McClintock, who observed the presence of 'controlling elements' randomly dispersed throughout the maize (Zea mays) genome (Comfort, 2001;Ravindran, 2012;McClintock, 1950). These interspersed repeats, which would come to be known as transposable elements (TEs), use flanking repetitive sequences to 'jump' around to different locations in the genome, often resulting in duplications of genetic material.
In contrast to interspersed repeats, tandem repeats (TRs) are regions in which repeating units lie in parallel (or in tandem) and are classified by size of the repeating unit as satellites (>60 base pairs), minisatellites (10-60 base pairs), or microsatellites (<9 base pairs). Short (2-6 base pair) tandem repeats (STRs) comprise between 1% and 3% of the human genome Wyner et al., 2020;Lander et al., 2001). In the early 1990s, a series of STR expansions were causally linked with human diseases, including spinobulbar muscular atrophy (La Spada et al., 1992), Fragile X Syndrome Wright Heitz et al., 1991;Oberlé et al., 1991;Verkerk et al., 1991;Yu et al., 1991), Huntington's disease (MacDonald et al., 1993), and myotonic dystrophy Buxton et al., 1992;Harley et al., 1992;Aslanidis et al., 1992;Fu et al., 1992;Mahadevan et al., 1992). As such, much of the research on STRs to date has centered on the mechanisms by which repeat expansions trigger neuronal toxicity. We will use the Fragile X locus as an exemplar of this now extensive body of literature, which is reviewed in more detail elsewhere (Malik et al., 2021a;Hagerman et al., 2017;Hagerman and Hagerman, 2015;Jacquemont et al., 2003;Glineburg et al., 2018), as it helps us understand how STRs might function normally in the absence of expansion.
Fragile X-associated disorders: the discovery of pathogenic short tandem repeats Fragile X Syndrome (FXS), the most common monogenic form of intellectual disability, was one of the first genetic diseases linked to an STR expansion Heitz et al., 1991;Oberlé et al., 1991;Verkerk et al., 1991;Yu et al., 1991). In 1943, Julia Bell and James Purdon Martin described an X-linked intellectual disability primarily affecting people assigned male at birth, that could be inherited from a carrier female parent or affected male parent (Martin and Bell, 1943). Karyotypes of affected individuals show a folate-sensitive fragile site on the X chromosome, which causes the chromosome to bend or break at one arm (Lubs, 1969;Proops and Webb, 1981;Chudley and Hagerman, 1987;Hagerman et al., 1986). The fragile site associated with FXS is located at the Fragile X messenger ribonucleoprotein 1 (FMR1) gene, which contains a large CGG repeat in the 5' UTR of affected individuals (>200 repeats) Heitz et al., 1991;Oberlé et al., 1991;Verkerk et al., 1991;Yu et al., 1991). In addition to intellectual disability, FXS patients commonly present with hyperactivity, anxiety, and seizures (Hagerman et al., 2017;Chudley and Hagerman, 1987;Hagerman and Hagerman, 2002a). Other chromosomal fragile sites also contain STRs, some of which are linked to other diseases (Glover, 2006;Debacker and Kooy, 2007;Schwartz et al., 2006). For example, Fragile XE syndrome (FRAXE), caused by a CGG repeat expansion in the FMR2 gene (Knight et al., 1993;Gu et al., 1996;Gecz et al., 1996), manifests in an X-linked intellectual disability similar to FXS (Mulley et al., 1995;Gecz, 2000). While studying pedigrees of Fragile X families, Stephanie Sherman and colleagues observed incomplete penetrance of mental impairment, affecting only 79% of males and 35% of females (Sherman et al., 1984;Sherman et al., 1985). This 'Sherman paradox' suggested a generational risk factor in Fragile X mental impairment, as unaffected 'normal transmitting' males (NTMs) passed on a mutant allele to unaffected female children, with disease manifestation in affected (predominantly) male grandchildren. Subsequent studies of CGG repeat length variation found that individuals from non-Fragile X families have 6-54 CGG repeats, while some unaffected individuals in Fragile X families have 55-200 repeats, a 'pre-mutation' associated with increased risk of further repeat expansion during oogenesis .
Subsequent work with Fragile X families revealed that Fragile X premutation expansion carriers often manifest clinically distinct disorders that are caused by the CGG repeat. Fragile X-associated tremor/ ataxia syndrome (FXTAS) is an age-linked neurodegenerative disorder characterized by progressive intention tremor and ataxia, parkinsonism, and cognitive decline (Hagerman and Hagerman, 2015;Jacquemont et al., 2003;Hagerman and Hagerman, 2004;Hagerman et al., 2001;Hagerman and Hagerman, 2002b;Leehey et al., 2003;Brunberg et al., 2002). As an X-linked disorder, FXTAS primarily affects people assigned male at birth. People with two X chromosomes may develop FXTAS, but are also at risk for developing Fragile X-associated premature ovarian insufficiency (FXPOI), a disorder characterized by absent or irregular menstrual cycles, early onset of menopause, and fertility issues (Hagerman and Hagerman, 2002a;Allingham-Hawkins et al., 1999;Murray et al., 2000a;Murray et al., 2000b). As 'premutation disorders', FXTAS and FXPOI are thought to share similar molecular mechanisms by which the premutation CGG repeat expansion causes cytotoxicity and dysfunction.
More than 50 REDs discovered to date show common mechanisms of molecular pathology (Malik et al., 2021a;Glineburg et al., 2018;Paulson, 2018;Rodriguez and Todd, 2019). FXS, FXTAS, and FXPOI, collectively referred to as Fragile X-associated disorders, are revisited throughout this review to exemplify the mechanisms by which STRs can cause cellular dysfunction and toxicity. However, the often-stereotyped manifestations of REDs, in addition to the abundance of repetitive elements throughout the genome, suggests that STRs could have native functions which become aberrant in the setting of repeat expansions. We will focus most of the rest of the review on this supposition.

Native STR functions
While overshadowed by disease-centered research, scientists have investigated functional consequences of repeat polymorphisms for decades. Studies of individual or small groups of genes showed phenotypic consequences of repeat length variation on flocculation and cell adhesion in yeast (Voynov et al., 2006;Levdansky et al., 2007;Verstrepen et al., 2005), limb and skull morphology in dogs (Fondon and Garner, 2004) and on behavioral traits in voles (Hammock and Young, 2005). Recent advances in sequencing technology and STR-conscious alignment techniques now permit the detection and characterization of thousands of new STRs and their variation across the human genome, and have enabled genome-wide study of the effect of repeat length polymorphisms on gene expression Payseur et al., 2011;McIver et al., 2011;O'Dushlaine and Shields, 2008;McIver et al., 2013;Willems et al., 2014;Mallick et al., 2016). As thousands of singlenucleotide polymorphisms (SNPs) have been linked with disease risk in Genome Wide Association Studies (GWAS; Tam et al., 2019;Uffelmann et al., 2021), ongoing studies of human genomes aim to link variation in STR length to phenotypic outcomes Fotsing et al., 2019;Mitra et al., 2021). In homage to the expression quantitative trait loci (eQTL) identified in traditional GWAS (Tam et al., 2019;Uffelmann et al., 2021), STRs associated with differences in expression of nearby genes are called eSTRs. In the following sections, we will first discuss evidence for evolutionary constraint on STRs linked to evolution across phylogeny and in humans. We will then showcase the mechanisms by which variation in STR length affects gene expression.

Repetitive DNA regulates transcription
Repetitive elements can impact the transcription of neighboring genes or the genes in which they reside by regulating chromatin structure and epigenetic markers. A role for repetitive DNA in facilitating 3D folding of the genome was first observed with TE-dependent formation of chromatin loops across multiple species, including yeast, Drosophila, and mammals ( Figure 1(i); Cournac et al., 2016;Bourque, 2009;Lu et al., 2021). Contact maps generated using chromosome conformation capture (Hi-C) show high co-localization of repetitive elements in nuclear space in humans, mice, and Drosophila, demonstrating a structural function of repetitive DNA (Cournac et al., 2016). The enrichment of transcription factor binding sites in proximity to spatially associated repeats suggests that repeat-mediated 3D DNA packaging may allow for context-dependent co-transcription of linearly remote genes (Cournac et al., 2016).
STRs play critical roles in maintaining chromatin structure (Nikumbh and Pfeifer, 2017;Sun et al., 2018;Volle and Delaney, 2012). For example, short CAG/CTG tracts avidly incorporate nucleosomes (Figure 1 (ii)), which are a basic subunit of chromatin packaging (Volle and Delaney, 2012). Nucleosome position varies with differences in STR length and flanking sequence context (Volle and Delaney, 2012), influencing chromatin structure and transcription of nearby genes. Other STRs, including CGG repeats, have the opposite effect and exclude nucleosomes in their native states, creating more open chromatin states near the transcription start sites of genes that favor local transcription (Wang et al., 1996;Wang, 2007). This feature may underlie the enrichment of CGG repeats in promoters and 5'UTRs (Uesaka et al., 2014).
STRs can also influence chromatin structure by modulating DNA methylation. Some STRs are prone to methylation, which can lead to gene silencing and the absence of transcription (Quilez et al., 2016;Garg et al., 2021;Pappalardo and Barra, 2021). One common example of repeat-mediated gene silencing, CpG islands are repeating di-nucleotide CpG sequences ranging from around 500-3000 base pairs and are located in ~40% of gene promoters across mammalian genomes (Deaton and Bird, 2011;Janitz and Janitz, 2011;Thomson et al., 2010;Clouaire et al., 2012;Blackledge et al., 2013). STRs are commonly located near CpG islands (Sun et al., 2018), and may influence their methylation states (Figure 1 (iii)). Moreover, other STRs can contain CpGs within their repetitive sequence that can undergo methylation (Bolton et al., 2013).
A genome-wide study in yeast (Vinces et al., 2009) estimates as many as 25% of promoters contain tandem repeats (TRs). Generally, expression of genes with TRs in their promoters increased with increasing repeat size. TRs in promoters may increase gene expression by increasing transcription Figure 1. Native functions of genomic repeats. Repeats in DNA influence larger 3D chromatin structures, regulate binding of nucleosomes and (de) acetylases and (de)methylases. They also influence transcription factor binding and polymerase processivity to affect downstream RNA production. Repeats in RNA can affect pre-mRNA processing such as alternative splicing and can affect RNA binding protein function through direct or indirect sequestration. Repeats in 3' UTRs serve as localization signals, directing mRNA transport. Repeats in 5' UTRs regulate translational output by impeding ribosome processivity. Repeating units in proteins can provide structural flexibility within a protein or serve as binding sites for the formation of multiprotein complexes. factor binding (Figure 1 (iv)), blocking or reducing nucleosome density, or in the case of AT-rich repeats, by facilitating DNA melting (Vinces et al., 2009). However, stable secondary structures formed by TRs can also inhibit transcription by impeding procession and access of transcriptional machinery (Figure 1 (v); Grabczyk and Fishman, 1995;Belotserkovskii et al., 2010;Usdin and Woodford, 1995). For example, the evolutionarily conserved THO complex is recruited to actively transcribed genes (Kim et al., 2004;Abruzzi et al., 2004;Strässer et al., 2002), and facilitates elongation of RNA polymerase (Fan et al., 1996;Prado et al., 1997;Fan et al., 2001;Jimeno et al., 2002;Chávez et al., 2000) through super-helical structures formed by long GC-rich TRs (Voynov et al., 2006;Chávez et al., 2001). Yeast strains with mutations to THO complex subunits exhibited lower levels of TR-containing FLO11 mRNA. Reduced FLO11 mRNA coincided with an accumulation of RNA polymerase at the beginning of the gene. Removal of the TR or overexpression of topoisomerase I to enhance unwinding of the structured DNA, rescued the reduction in FLO11 mRNA in THO complex mutants (Voynov et al., 2006).
As in yeast, human STRs can either enhance and inhibit transcription of associated genes dependent on their sequence and locations, and also affect gene expression via changes in gene methylation and chromatin structure Fotsing et al., 2019;Quilez et al., 2016;Garg et al., 2021;Jakubosky et al., 2020). Together, these studies demonstrate numerous mechanisms by which TRs can enhance or inhibit gene expression.

STRs in RNA regulate pre-mRNA processing and RNA localization
Transcribed repetitive elements regulate numerous aspects of RNA biology. STRs in RNAs form complex higher order structures, including G-quadruplexes and hairpins (Krzyzosiak et al., 2012;Sobczak et al., 2003;Malgowska et al., 2014;Sobczak et al., 2010), which are thought to exert broad influence over pre-mRNA splicing (Figure 1 (vi)) (Muro et al., 1999;Tu et al., 2000;Black, 2003;Solnick and Lee, 1987). An analysis of human introns found that sites of alternative splicing are enriched for STRs (Lian and Garner, 2005). STRs can facilitate alternative splicing by complementary pairing of intronic repeats, bringing exonic regions into close proximity (Lian and Garner, 2005). Structure-forming STRs can inhibit or enhance alternative splicing by blocking or facilitating the recruitment of splicing factors, respectively (Lian and Garner, 2005). For example, alternative splicing of the EIIIB exon in the well-conserved fibronectin gene is regulated by an intronic TGCATG repeat (Huh and Hynes, 1994;Lim and Sharp, 1998). Contractions in this STR reduce EIIIB exon inclusion, while overexpression of a specific splicing factor, SRp40, stimulates inclusion. While the TGCATG repeat differs from SRp40's consensus binding site, it can form a strong hairpin structure, which is a key feature of SRp40 binding site motifs (Tacke et al., 1997). This suggests that the TGCATG repeat may modulate alternative splicing by recruiting the SRp40 splicing factor to the intron/exon boundary (Lim and Sharp, 1998).
Some STRs in RNA regulate splicing in trans, by binding to and sequestering splicing factors and blocking their functions (Figure 1 (vii)). A recent study identified a group of novel long non-coding RNAs (lncRNAs) with multiple predicted RNA binding motifs (Yap et al., 2018), a subset of which contained long stretches of STRs ('strRNAs'). One strRNA called the pyrimidine-rich noncoding transcript (PNCTR) contains numerous stretches of (TC) n repeats, avidly binds to the polypyrimidine tractbinding protein (PTBP1) in cells, and negatively regulates PTBP1-mediated splicing (Yap et al., 2018). As such, PNCTR overexpression was sufficient to trigger mis-splicing of PTBP1 targets and trigger programmed cell death (Yap et al., 2018). In this way, STRs in RNA can regulate the global availability of other RNA-binding proteins (RBPs) with other functions, exerting profound control over numerous aspects of cell biology.
STRs in 3' UTRs can also serve as RNA localization signals, and via interactions with RBPs, facilitate the transport of RNAs to specified cellular compartments (Figure 1 (viii)). A program called REPFIND was developed to analyze 3' UTRs of localized mRNAs in Xenopus oocytes and identified various CAC-containing repeat motifs that serve as localization elements (Betley et al., 2002). Mutating these CAC-containing repeats was sufficient to abolish normal RNA localization. CAC-containing repeats were also found in zebrafish and human 3' UTRs of transcripts that are known to be specifically localized within cells, suggesting that CAC-containing repeats are conserved localization elements in chordates (Betley et al., 2002). REPFIND was subsequently used to generate a database of repeating motifs in 3' UTRs of mammalian genes from the Mammalian Gene Collection (MGC) that revealed hundreds of human genes containing short CAC-and CAG-rich repeats in their 3' UTRs (Lim and Sharp, 1998). Intriguingly, these elements facilitate RNA localization to neurites in rat hippocampal neurons (Andken et al., 2007).
Indeed, libraries of synthetic (Millette et al., 2022) and naturally occurring (Li et al., 2017;Niederer et al., 2022) hairpin sequences placed within 5' UTRs can be used to precisely control translational transgene output, with potential implications for gene therapy dosing. These studies show how single unit variations in STRs can precisely modulate protein expression, generally permitting more and faster translation of mRNAs with smaller STRs, and less and slower translation of mRNAs with larger STRs.

Repeats in proteins facilitate multi-protein complex formation and structural flexibility
Eukaryotic proteins are more likely to have repeats than prokaryotic proteins, and proteins containing repeats are often unique to eukaryotes and eukaryotic functions (Marcotte et al., 1999). There are numerous long repeating motifs in proteins (>20 amino acids/repeat) with loose homology between repeats, that form complex tertiary structures (Andrade et al., 2001). These protein repeat domains are characterized by the structures they form, as all-β (i.e. β-propellers, β-trefoils), all-α (i.e. HEAT and tetratricopeptide repeats (TPRs)), or mixed α/β (i.e. leucine-rich repeats, ankyrin repeat; Andrade et al., 2001). Although their specific functions vary, protein repeat domains typically serve as binding sites, and are thought to have evolved in eukaryotes to aide in the formation of multi-protein complexes with advanced cellular functions (Andrade et al., 2001;Kajava, 2012;Sharma and Pandey, 2015).
STRs translated into proteins, are thought to have similar functions as these larger repeat-based protein domains, serving as sites for protein-protein interactions (Figure 1 (x); Schaefer et al., 2012;Faux, 2012). CAG repeats are enriched in coding regions and are most frequently found in the polyglutamine (polyQ) reading frame, suggesting that polyQ stretches in proteins have a native function (Schaefer et al., 2012). PolyQ stretches are enriched in proteins that are components of multi-protein complexes, and have functions in transcriptional control, phosphatidylinositol (PI) signaling, protein degradation, and chromatin remodeling. Evolutionary sequence comparison reveals that the location of polyQs within a protein is not always conserved (Schaefer et al., 2012). This suggests that polyQ stretches have evolved multiple times, and don't directly confer a protein's function, but rather modulate the protein-protein interactions necessary for those functions (Schaefer et al., 2012;Orr, 2012). Other CG-containing STRs (i.e. CUGs and CGGs) show similar patterns of overrepresentation in coding regions (Schaefer et al., 2012), and likely serve similar complex-scaffolding functions (Nasrallah et al., 2012).
STRs when translated into proteins can be critical for proper protein folding. For example, translation of a CAG repeat in the huntingtin gene (HTT) produces a polyQ tract in the HTT protein which serves as a flexible hinge, allowing the neighboring domains to fold into close proximity (Figure 1 (xi)) (Caron et al., 2013). HTT protein structure is altered with repeat expansion, demonstrating the importance of the flexibility conferred by this STR (Caron et al., 2013).

Pathogenic consequences of STRs: A Fragile X case study
In the previous section, we summarized how STRs in DNA, RNA, and when translated into proteins can affect gene expression and protein function. In the following section, we will draw parallels from these native functions of STRs to pathogenic mechanisms in REDs (Figure 2). These parallels demonstrate Figure 2. STR-associated toxicity in Repeat Expansion Disorders. Repeat expansions can alter global 3D chromatin structure, and influence transcription via blocking or enhancing binding of nucleosomes, (de)acetylases, (de)methylases, and transcription factors. Expanded repeats may also impede polymerase processivity. In some cases, elevated transcription of repeat expansion RNA can lead to depletion of RNA-binding proteins. Depletion of these proteins can impact many processes to which they contribute, including pre-mRNA splicing and processing, and mRNA localization. Expanded repeat RNA and bound RBPs can also aggregate into RNA foci, causing toxicity. Expanded repeat RNA can stall translational complexes, leading to repeat-associated non-AUG (RAN) translation, and contribute to the production of polymeric proteins. Polymeric proteins are aggregate prone. Longer polymeric stretches in native proteins may also cause dysfunction by preventing proper protein folding or causing the folded protein to mis-localize within the cell. how STR toxicity can be viewed as aberrancies of native processes, rather than emergent dysfunctions. For this analysis, we will largely use the Fragile X locus discussed earlier as a well-characterized case study, although many of these principles also apply to other REDs and a few specific examples are included here (reviewed in broader detail in Malik et al., 2021a;Glineburg et al., 2018;Paulson, 2018;Rodriguez and Todd, 2019).

Epigenetic and transcriptional dysfunction of STRs in DNA
The functional consequences of STRs on genome organization and transcription are evident when dysfunction is observed in REDs (Dion and Wilson, 2009;López-Martínez et al., 2020;Yin et al., 2020;Usdin, 2008). Repeat expansions can alter local genome architecture and expression of neighboring genes. A prime example is observed at a CTG repeat in the 3'UTR of the DMPK gene associated with myotonic dystrophy type 1 (DM1), expansion of which alters local chromatin structure and suppresses transcription of neighboring gene, Six5 (Winchester et al., 1999;Brouwer et al., 2013;López Castel et al., 2011). Repeat expansions also cause global alterations in chromatin structure. CGG repeat expansions in FXS patients cause severe disruptions in chromatin boundaries (Figure 2 (i); Sun et al., 2018). These disruptions may explain delayed DNA replication (Subramanian et al., 1996), activation of DNA replication stress pathways (Chakraborty et al., 2020) and altered local DNA replication patterns  observed at CGG repeat expansions and the Fragile X locus in particular.
As genomic repeats influence native DNA methylation, some STRs are aberrantly methylated only upon expansion (Figure 2(ii); Otten and Tapscott, 1995;Steinbach et al., 1998;Herman et al., 2006;Greene et al., 2007;Belzil et al., 2013;Xi et al., 2013). When the CGG repeat in the 5' UTR of FMR1 expands beyond 200 repeats, it is susceptible to DNA methylation of both the CpG elements within the repeat and at a CpG element within the FMR1 promoter (Oberlé et al., 1991;Sutcliffe et al., 1992;Pieretti et al., 1991;McConkie-Rosell et al., 1993;Hansen et al., 1992;Coffee et al., 2002;Colak et al., 2014;Willemsen et al., 2002). This hypermethylation is associated with FMR1 gene silencing, with a resulting absence of FMR1 mRNA and FMRP, a critical RBP involved in synaptic plasticity and neuronal function (Oberlé et al., 1991;Hagerman et al., 2017;Quartier et al., 2017;Myrick et al., 2014). How exactly repeat expansion triggers methylation and the relationship between expansion, methylation, and epigenetic silencing is not fully understood, but the locus remains transcriptionally active and unmethylated in human embryonic stem cells even in the presence of very large repeat expansions, with silencing occurring during differentiation. Some studies suggest that FMR1 silencing requires co-transcriptional binding of CGG repeat mRNA directly to the FMR1 promoter region as an RNA-DNA heteroduplex Groh et al., 2014). STR expansions can enhance or inhibit mRNA production from nearby genes. At FMR1, premutation range CGG repeats which cause FXTAS or FXPOI (and which are unmethylated) result in elevated transcription of FMR1 mRNA (Tassone et al., 2000a;Tassone et al., 2000b;Entezam et al., 2007;Brouwer et al., 2008;Kenneson et al., 2001). This may result from use of additional upstream transcription start sites (Beilina et al., 2004;Tassone et al., 2011), or be associated with enrichment of acetylated histones or other chromatin activating factors at the premutation allele (Todd et al., 2010). It's possible that both hypo-expression and hyperexpression of FMR1 stems from the complex structures formed by these CGG repeats as DNA (Usdin and Woodford, 1995;Fry and Loeb, 1994;Kettani et al., 1995;Patel et al., 2000). As seen in native STRs, different structures formed by expanded STRs could facilitate or block binding of histone-modifying methylases, demethylases, acetylases, deacetylases, and even entire nucleosomes to affect downstream gene expression (Figure 2 (iii-v)) (Wang et al., 1996;Usdin and Kumari, 2015).
Repeat expansions cause defects in pre-mRNA processing and mRNA localization The native roles of STRs in RNA in regulating splicing mirror splicing dysfunction observed in numerous REDs (Figure 2 (vi)). Splicing of the HTT huntingtin gene, which contains a CAG repeat, is altered at expanded repeats associated with Huntington's Disease, resulting in the production of a transcript containing only exon 1 and the production of an exon 1 HTT protein Sathasivam et al., 2013;Neueder et al., 2017;Neueder et al., 2018;Franich et al., 2019). The exon 1 HTT protein is found in patient tissues and is toxic in model systems Sathasivam et al., 2013;Neueder et al., 2017;Neueder et al., 2018;Franich et al., 2019). Incomplete splicing of HTT with the CAG repeat expansion increased with overexpression and decreased with knockdown of splicing factor SRSF6. SRSF6 is predicted to bind to the 5' end of HTT transcripts via the CAG repeat, suggesting that SRSF6-CAG repeat interactions interfere with spliceosome formation at the nearby splice site (Neueder et al., 2018).
In addition to the depletion of RBPs and consequent defects in RNA splicing and localization and miRNA processing, expanded STRs in RNAs may also cause toxicity by self-association (gelation) (Figure 2(ix); Glineburg et al., 2018;Sellier et al., 2010;He et al., 2014;Jain and Vale, 2017;Ciesiolka et al., 2017;Fay et al., 2017;Tassone et al., 2004). Yet, these processes also occur on RNAs with shorter STRs that are below the pathological threshold for disease, suggesting such that such phase separation properties of specific RNA motifs and their associated RBPs may exist on a spectrum from physiologic to pathologic.
Expanded STRs in RNA can mis-localize or be retained in the nucleus instead of transported to its functional location in the cell (Davis et al., 1997;Mastroyiannopoulos et al., 2010;Sun et al., 2015). This may be mediated by splicing defects , via export-inhibiting RBP interactions , or via a larger dysfunction of nucleocytoplasmic transport Zhang et al., 2016;Jovičić et al., 2015;Freibaum et al., 2015;Grima et al., 2017;Gasset-Rosa et al., 2017;Sellier et al., 2017). For example, SRSF proteins bind to CGG and G4C2 repeats and appear critical to their cytoplasmic transport out of the nucleus (Malik et al., 2021b;Hautbergue et al., 2017). In this context, lowered expression of SRSF proteins or inhibition of the SRSF protein kinase SRPK1, which regulates SRSF nuclear entry, suppress CGG repeat exit to the cytoplasm and reduce toxicity in Drosophila and neuronal model systems (Malik et al., 2021b). Together, these studies show that expanded STRs in RNA can induce toxicity via RBP depletion or by direct RNA dysfunction.

Repetitive proteins have pathogenic consequences
Repeat-containing peptides, produced via canonical translation of STRs in coding regions or via RAN translation, contribute to toxicity in REDs. At CGG repeats, both FMRpolyG and FMRpolyA are present within intranuclear neuronal inclusions in patient tissues Todd et al., 2013;Buijsen et al., 2014;Krans et al., 2019;Ma et al., 2019), and are toxic in model systems Todd et al., 2013;Derbis et al., 2018;Gohel et al., 2019;Hoem et al., 2019). FMRpolyG, the most abundant CGG RAN product, is necessary for CGG repeat toxicity and inclusion formation Todd et al., 2013;Oh et al., 2015) in overexpression models. Numerous RAN or homopolymeric peptides generated in other REDs are essential for their toxicity and formation of proteinaceous inclusions (Figure 2 (xii); Yamamoto et al., 2000;Schilling et al., 1999;Ordway et al., 1997;Bäuerlein et al., 2017;Paulson et al., 1997;Mizielinska et al., 2014;May et al., 2014;Zu et al., 2013). Overall, dysfunctional aggregation of repeat derived protein products mirrors the native function of STRs in proteins as facilitators of protein-protein interactions.
Translation through large STRs that form stable secondary structures likely induces ribosome stalls and elongation errors. A growing body of work shows that disease-associated STRs undergo stallinduced translational frameshifting to produce novel chimeric polypeptides (Gaspar et al., 2000;Toulouse et al., 2005;Davies and Rubinsztein, 2006;Tabet et al., 2018;McEachin et al., 2020;Wright et al., 2022), and several of these studies have shown that these frameshift products have distinct contributions to neuronal dysfunction in disease (Tabet et al., 2018;McEachin et al., 2020;Wright et al., 2022). While there is evidence that polymeric peptides contribute to toxicity observed in REDs via aggregation, the mechanistic details of homo-and di-polymeric peptide toxicity and chimeric polypeptide toxicity remain under investigation.

Antisense transcripts contribute to REDs via multiple mechanisms
Antisense transcription from the FMR1 locus generates multiple long-noncoding asFMR1 mRNAs, with some including the repeat (Ladd et al., 2007;Khalil et al., 2008;Elizur et al., 2016;Pastori et al., 2014). One antisense transcript, FMR4, is thought to play a critical role in regulating the cell cycle and apoptosis (Khalil et al., 2008). Another antisense transcript, FMR6, is upregulated in premutation women, with increased repeat length correlating to elevated RNA levels and reduced number of oocytes, suggesting a relationship between antisense transcript expression and toxicity (Elizur et al., 2016). FMR1 antisense transcription in general is upregulated in Fragile X premutation disorders and lost in FXS, like the sense FMR1 mRNA (Ladd et al., 2007). Moreover, asFMR1 mRNAs containing the CCG repeats can undergo RAN translation, producing additional homopolymeric proteins with toxic potential (Kearse et al., 2016). STR-containing antisense transcripts likely contribute to toxicity observed in many REDs, but this is best characterized in C9ALS/FTD and SCA8, where antisense transcripts are found in toxic RNA foci and contribute to RAN peptide production (Mori et al., 2013a;Zu et al., 2011;Zu et al., 2013;Moseley et al., 2006;Gendron et al., 2013).

Mechanisms of STR toxicity reveal novel native functions of STRs
Studies over the past three decades have delineated numerous mechanisms by which repeat expansions trigger cellular toxicity. Yet, there are striking parallels between the pathologic drivers of dysfunction elicited by repeat expansions and the native functions of STRs in regulating gene expression.
In this section, we provide examples of how mechanisms initially identified as causing STR toxicity directly inform our understanding of native functions of STRs more broadly. We also discuss how emergent pathogenic properties associated with repeat expansions might inform additional native functions of repeats that are not yet well understood.

RAN translation occurs at native repeat lengths and have native functions
While CGG repeats in the FMR1 gene were primarily studied for their disease association, the CGG repeat is present in all humans at nonpathogenic lengths (<55 repeats) and conserved across mammals (Eichler et al., 1995;Sellier et al., 2017). Some studies suggest phenotypes associated with low CGG repeat numbers at this allele in humans, including memory difficulties and language dysfluency (Klusek et al., 2018;Mailick et al., 2014). Our group observed that CGG RAN translation, originally thought to be an aberrant toxic event, occurs in reporters with native repeat lengths (25 repeats) (Kearse et al., 2016), suggesting CGG repeats and/or translation of those repeats may have a native function in addition to the pathogenic one. CGG RAN translation at native and expanded STRs acts as an overlapping upstream open reading frame (uORF), inhibiting translation of the downstream main ORF (mORF) and thereby reducing FMRP synthesis (Rodriguez et al., 2020). Furthermore, this RAN uORF-like regulation of FMRP synthesis was critical for facilitating translational changes associated with stimulation of metabotropic glutamate receptors (mGluRs) in neurons (Rodriguez et al., 2020).
Upstream open-reading frames (uORFs) are well-characterized regulatory elements in eukaryotes that influence expression of protein produced from the main open reading frame (mORF) on the same transcript, and are typically inhibitory to downstream mORF translation (Hinnebusch et al., 2016). In this way, uORFs resulting from RAN translation of STRs may play a global role in regulating mRNA translation, presenting another mechanism by which STRs influence gene expression.

STRs facilitate protein function and localization
Expanded STRs in coding regions can fundamentally change the functions of the proteins within which they reside. In spinocerebellar ataxia type 1 (SCA1), a CAG repeat expansion in the ataxin 1 (ATXN1) gene changes the localization of ATXN1 protein (Irwin et al., 2005). ATXN1 normally shuttles between the nucleus and the cytoplasm, but an expanded polyQ region shifts ATXN1 localization to the nucleus (Figure 2(xiii); Irwin et al., 2005). Aberrant nuclear localization of ATXN1 underlies dysfunction in SCA1 (Lam et al., 2006;Lai et al., 2011;Klement et al., 1998;Emamian et al., 2003;Duvick et al., 2010), as modifications that favor nuclear localization are sufficient to elicit disease relevant phenotypes in the absence of the repeat expansion in mouse models.
PolyQ-associated nuclear translocation is also central to pathology in spinal and bulbar muscular atrophy (SBMA), where ligand binding and translocation to the nucleus of the expanded PolyQcontaining androgen receptor is required to elicit disease-associated transcriptional defects and cytotoxicity (Katsuno et al., 2006;Katsuno et al., 2002;Montie et al., 2009;Palazzolo et al., 2007). However, within the normal range of polyQ lengths observed in humans, Androgen receptor CAG repeat size inversely correlates with the receptor's transactivational activity and linearly correlates with infertility and decreased sperm function (Choong and Wilson, 1998;Osadchuk et al., 2022;Pan et al., 2016). These findings suggest that the CAG repeats play a normal role in testosterone activated gene cascades that become aberrant at larger repeat sizes.

STRs facilitate mRNA transport to dendrites
An investigation into dendritic mRNA localization identified a localization pathway dependent on the interaction of a CGG repeat-interacting RBP, hnRNP A2, with a GA dendritic targeting element of an RNA (Muslimov et al., 2011). This GA targeting motif was competed for by CGG repeat-containing RNAs, including FMR1 mRNA. In addition to a native function of CGG repeats as a dendritic localization factor, this study revealed that elevated levels of CGG repeat mRNA caused by the CGG premutation expansion sequester hnRNP A2, resulting in global dysfunction in the transport of hnRNP A2-target mRNAs (Muslimov et al., 2011). Another study seeking to reveal transcriptome-wide impacts of C(C) UG repeat-mediated MBNL depletion on splicing in myotonic dystrophy (DM) also uncovered a global role for MBNL in mRNA localization (Wang et al., 2012).

PolyQ containing proteins regulate autophagy
Numerous REDs are caused by CAG repeats, including the huntingtin gene in Huntington's disease (HD) and Ataxin 3 in spinocerebellar ataxia type 3 (SCA3), with toxicity largely attributed to the aggregation of long polyQ containing proteins. Autophagy induction results in clearance of these aggregates, attenuating their toxicity (Rubinsztein, 2006;Ravikumar et al., 2004;Menzies et al., 2010). PolyQ tracts in ataxin 3, a deubiquitinase associated with spinocerebellar ataxia type 3 (SCA3), interact with beclin 1, a key initiator of autophagy (Ashkenazi et al., 2017). Ataxin 3 then deuniquitinates beclin 1, protecting it from degradation and permitting autophagy initiation. Ataxin 3 activity and interaction with beclin 1 is competitively inhibited by other polyQ tract-containing proteins in a length-dependent manner (Ashkenazi et al., 2017). As such, polyQ tracts may actively engage protein quality control pathways basally but then these interactions become aberrant after STR expansion, in this case inhibiting autophagy and clearance of toxic proteins. Together, these studies suggest that the pathology of disease-associated STR expansions reveal native functions of STRs, just as an improved understanding of the native functions of STRs can inform on dysfunctions in disease.

Tetranucleotide, pentanucleotide, and biallelic repeat expansion disorders
Tetranucleotide and pentanucleotide STRs are rare within coding sequences, presumably because changes in their repeat number would trigger translational frameshifts. However, they are relatively common within introns, where their expansion causes several neurological disorders that likely act through pathogenic mechanisms that are similar to those exhibited by non-coding trinucleotide STRs. For example, Myotonic dystrophy type 2 (DM2) results from a dominantly inherited intronic CCTG STR expansion in ZNF9 (Liquori et al., 2001). CCTG STRs form RNA secondary structures that are like those generated by CTG STRs, and like the 3' UTR CTG repeat in DM1, the DM2 repeat binds to and sequesters the RBP muscleblind (Botta et al., 2008;Paul et al., 2011;Fardaei et al., 2001;Mankodi et al., 2001;Miller et al., 2000;Du et al., 2010;Philips et al., 1998). This shared mechanism explains the significant overlap in their clinical phenotypes. Perhaps more interesting, however, is how subtle differences in how these repeats underlie the phenotypic differences in these conditions. In particular, CCTG expansions in DM2 do not trigger genetic anticipation or congenital forms of disease as occurs in DM1 despite the presence of very large CCTG expansions in DM2. These phenotypic differences are thought to occur for two reasons. First, these repeats exhibit differences in how they interact with other RBPs, such as rbFOX, that modulate the effects of muscleblind sequestration . Second, differences in the genic positioning (intron versus 3' UTR) and temporal expression of the two STRs alters their relative abilities to disrupt early developmental processes (Thomas et al., 2017;Cerro-Herreros et al., 2017).
An intriguing feature observed in multiple pentanucleotide repeat expansion disorders, including complex TTTTA a d TTTCA repeats that cause benign adult familial myoclonic epilepsy (BAFME) in multiple genes (Ishiura et al., 2018), ATTTC repeats in Spinocerebellar ataxia (SCA) type 31 (Sato et al., 2009), and AAGGG repeats that cause cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS) Cortese et al., 2019;Tsuchiya et al., 2020), is that the pathogenic alleles represent non-reference STRs. That is, the repeat element is not only expanded in size, but it has a different sequence than the normal allele. For example, in CANVAS, an AAAAG pentanucleotide STR normally resides within the first intron of RFC1. However, the pathological repeat is a qualitatively different and expanded AAGGG repeat. Moreover, CANVAS can also occur with a third pentanucleotide repeat, ACAGG, at this same genomic location. In all these cases, these pentanucleotide repeats occur within the polyA region of an Alu transposable element. Active Alu transposition requires pure polyA elements at their 3' ends (Deininger, 2011). As such, there is strong evolutionary selection pressure favoring mutation of these regions to pure polyA sequences. This suggests that both the reference and non-reference STRs occurred initially through a protective process that disrupted the polyA element and prevented continued transposase activity. However, stochastic differences in these interrupting mutations created some STRs that were more prone to expansion, resulting in pathogenic alleles that either create toxic STR RNAs or that interfere with local gene expression.

Conclusions and open questions Native functions of STRs from an evolutionary perspective
Evolutionary pressures on STR copy number predict that repeat expansions will be either tolerated or selected for until an upper, deleterious limit is reached. If STRs were intrinsically deleterious, then there would be selective pressure towards repeat contractions, leading to global reductions in STR size or their selective elimination. However, several recent studies suggest that there is selective STR expansion across phylogeny, especially in primates. This is particularly true in 5' UTRs and coding regions, where constraints on repeat expansion and contraction are greatest . At the same time, many intergenic STRs show correlations between their size and the expression of neighboring genes. These eSTRs contribute meaningfully to population variance in gene expression profiles and disease associated Quantitative trait loci (eQTLs) in human populations (Fotsing et al., 2019;Gymrek et al., 2017). In general, these eSTRs are largely unconstrained unless neighboring or embedded within a gene already under strong constraint. This mutation-selection balance suggests some inherent native functions of STRs within transcript and protein space while also implying that intrinsic STR instability may allow for more rapid variation and acquisition of traits through local perturbations in gene expression than could be accomplished through single nucleotide mutations ( Figure 3A). The highly variable methylation status, mRNA and protein expression patterns elicited by differences in FMR1 CGG repeat lengths typify the potential for repeat variation within populations to influence gene expression ( Figure 3B). As even subtle changes in repeat size can tune gene expression and protein function and have downstream impacts on simple and complex phenotypes, they may be an important component of the genetic differences between humans and other species, and among humans themselves.

Revisiting our understanding of and approach to Repeat Expansion Disorders
Historically, repetitive elements within human genomes have been viewed as mostly unregulated 'junk DNA' that is not under selective evolutionary pressure. As such expansions of these repetitive elements are unfortunate accidents which become apparent and important only when they elicit highly penetrant and syndromic human diseases. Consistent with this line of reasoning, the field of REDs has largely focused on emergent toxic mechanisms as drivers of disease only in the setting of large STR expansions rather than considering their pathology as alterations in the native functions played by these repeats in their normal genomic contexts. Here, we propose re-framing the discussion around repetitive elements in general-and STRs in particular-within human genomes. For each STR, we suggest first considering whether the STRs associated with a human disease have any native functions at their 'normal' size. If a native function exists, then expansion of these STRs can be viewed primarily as an aberrancy of that native function with coincident predictable impacts on gene expression dysregulation above certain repeat lengths. This reframing aligns with the approach typically taken in studying gain-of-function and loss-of-function mutations in disease associated single amino acid mutations and better ties the native functions of STRs to their pathology. It also suggests that shared regulatory rules will likely apply across REDs.
This approach to thinking about REDs leads to specific predictions. First, we predict that more REDs will be discovered in the future. For example, multiple recently described REDs are linked to CGG repeats, including neuronal intranuclear inclusion disease (NIID) (Ishiura et al., 2019;Sone et al., 2019), oculopharyngodistal myopathy (OPDM) and leukodystrophy (OPML) (Ishiura et al., 2019;Deng et al., 2020;Ogasawara et al., 2020;Tian et al., 2019), adult onset leukoencephalopathy (Okubo et al., 2019), and autism/intellectual disability (Annear et al., 2022). Most of these new CGG repeatopathies reside within the 5' UTRs, like the CGG repeat in FMR1, and there is already evidence of convergent disease mechanisms triggered by these new repeats with those already established in Fragile X disorders. In one particularly notable example, a CGG repeat expansion in NOTCH2NLC leads to the creation of an AUG-initiated upstream open reading frame in the 5' UTR that is generates a polyglycine-containing protein akin to FMRpolyG in FMR1 (Liu et al., 2022;Boivin et al., 2021). This polyglycine protein is found within inclusions in patients with NIID and its generation is required to trigger inclusion formation and behavioral phenotypes in a mouse model of NOTCH2NLC associated NIID. As such, we know that this motif in this location within neuronally expressed genes can elicit dysfunction through predictable mechanisms. This means that we should expect other CGG Figure 3. The effects of STR variation on local gene expressivity. (A) Bi-allelic variation in a gene through single nucleotide polymorphisms often result in small and discrete differences in gene expression, offering limited phenotypic differences across a population with slow evolutionary timescales. In contrast, STRs in promoters and 5'UTRs can influence protein expression over a broader dynamic range, with an inverse correlation between repeat length and protein output within transcribed regions and with differential effects on transcription dependent on the repeat and local epigenetic context. Unstable repeats change rapidly from generation to generation (and even within an individual through somatic variation), creating a mechanism by which mRNA or protein expression can vary broadly and subtly across a population, offering greater genetic and phenotypic diversity and a greater propensity for disease-causing aberrancies at the extremes. (B) Predicted effects of CGG repeat length on FMR1 gene expression. CGG repeat length influences FMR1 promoter epigenetic state (more open chromatin with initial expansion, then DNA methylation and closed chromatin at >200 CGG repeats), FMR1 mRNA expression, and FMRP protein expression across the polymorphic range.
repeat expansions to emerge that mirror the pathologic processes established for the FMR1 locus and now extended to a large set of loci. Similarly, given evidence that the CGG repeat in FMR1 5'UTRs can serve as a functional element that regulates transcription, mRNA localization and translation, we predict that native CGG repeat elements in these disease-associated alleles may have normal functions akin to those observed for FMR1, and as such represent a functional motif shared among many genes.
However, these new REDs may not all fit the typical model observed to date, where highly penetrant STR expansions lead to syndromic disorders. Instead, smaller changes in repeat size at multiple loci, impacting expression of the genes in which they reside or neighboring genes, will serve as risk alleles for common conditions. This risk-allele model is already apparent, as intermediate CAG repeat expansions in ATXN1, ATXN2, and HTT are associated with sporadic ALS and some other common neurodegenerative disorders (Elden et al., 2010;Rosas et al., 2020). Indeed, a fair proportion of the unexplained signal within Genome Wide association Studies (GWAS) can be explained by variations within neighboring STRs Gymrek, 2017;Hannan, 2018). To date, numerous STR variants have been linked to ASD (Mitra et al., 2021;Trost et al., 2020) and Schizophrenia (Mojarad et al., 2022). As PCR-free and long-read whole genome data becomes more abundant and available (reviewed in Mitsuhashi and Matsumoto, 2020), it will become increasingly easy to detect these dynamic repeat size/disease relationships, creating a whole new class of STR-associated conditions that will likely expand outside of neurological conditions.
Second, we predict that long-read whole genome sequencing datasets will improve our understanding of the native roles of STRs in humans, and reveal a ubiquitous impact of repeat length variation on gene expression. Once we create accurate maps of STR variation across the genome and link this variation to neighboring gene loci expression, we will be able to better discern the mechanisms by which STRs influence gene expression across cell types. We predict that many genes whose expression is affected by neighboring repeat length variation will play critical functions in the nervous system. Most known REDs present with neurological symptoms. If REDs reflect the native functions of STRs, then the overrepresentation of neurological dysfunctions linked to STR expansions suggests that STRs may play roles relevant to neuronal health and function. It is also possible that neurons, as terminally differentiated cells, may be more prone to somatic instability, leading to repeat expansion and the emergence of associated dysfunction with age.
Finally, we predict that the native functions of STRs will inform our understanding of how STR expansions cause disease and vice versa. A deeper understanding of the native functions of both disease-associated STRs and STRs in general will reveal the pathways altered in REDs, and these pathways may be areas for therapeutic intervention that can be applicable across all REDs. By studying the mechanisms by which STRs elicit disease, we can also surmise key elements of how they might function normally within nervous systems (see examples in previous section, "Mechanisms of STR toxicity reveal novel native functions of STRs"). Ultimately, research into native functions of STRs will reveal both mechanisms by which they regulate neuronal function and therapeutic targets by which their toxicity in REDs can be mitigated.